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ABSTRACT 

DR_bind is a web server that automatically predicts 
DNA-binding residues, given the respective protein 
structure based on (i) electrostatics, (ii) evolution 
and (iii) geometry. In contrast to machine-learning 
methods, DR_bind does not require a training data 
set or any parameters. It predicts DNA-binding 
residues by detecting a cluster of conserved, 
solvent-accessible residues that are electros- 
tatically stabilized upon mutation to Asp /Glu . 
The server requires as input the DNA-binding 
protein structure in PDB format and outputs a 
downloadable text file of the predicted DNA- 
binding residues, a 3D visualization of the predicted 
residues highlighted in the given protein structure, 
and a downloadable PyMol script for visualization of 
the results. Calibration on 83 and 55 non-redundant 
DNA-bound and DNA-free protein structures yielded 
a DNA-binding residue prediction accuracy/preci- 
sion of 90/47% and 88/42%, respectively. Since 
DR_bind does not require any training using 
protein-DNA complex structures, it may predict 
DNA-binding residues in novel structures of 
DNA-binding proteins resulting from structural 
genomics projects with no conservation data. 
The DR_bind server is freely available with no 
login requirement at http://dnasite.limlab.ibms 
.sinica. edu.tw. 

INTRODUCTION 

Interactions between proteins and DNA play essential 
roles for life. For example, protein-DNA interactions 
control gene regulation, cell replication and transcription, 



as well as DNA repair. Furthermore, many of these 
DNA-binding proteins are involved in human diseases 
such as neurological disorders, e.g. TDP-43 (1), and 
cancer; e.g. p53 (2). Consequently, identifying the key 
amino acid residues involved in DNA recognition is 
critical for understanding these important biological 
processes. It also guides which residues to mutate in 
experimental studies. 

Several methods and web servers have been developed 
to predict DNA-binding residues from the protein ID 
sequence or 3D structure. Methods that predict DNA- 
binding residues using only the protein sequence generally 
employ machine-learning algorithms such as a neural 
network (3-5), a Naive Bayes classifier (6), a support 
vector machine (7-12), random forest (13,14), or 
decision trees (C4.5 algorithm) (15). These algorithms 
usually employ amino acid physicochemical properties, 
sequence conservation, the local sequence context, 
solvent accessibility and/or secondary structure. Publicly 
available web servers that implement sequence-based 
methods for predicting DNA-binding residues include 
DBS-PRED (3), DBS-PSSM (5), DNABindR (6), 
DP-Bind (8), DISIS (9), BindN-rf (14), BindN+ (12), 
NAPS (15) and MetaDBSite (16). Methods that use the 
protein structure, if available, generally improve the 
DNA-binding site prediction, as they replace the predicted 
solvent accessibility, hydrophobicity and secondary struc- 
ture in sequence-based methods with observed ones and 
can additionally employ energies or frequencies, computed 
from the atomic coordinates, as well as experimental 
geometrical features. Structure-based methods for predict- 
ing DNA-binding residues employ mostly electrostatic po- 
tentials in conjunction with other features such as surface/ 
solvent accessibility, the protein surface shape, amino acid 
conservation, propensity, hydrophobicity and hydrogen- 
bonding potential and structural motifs (17-22), or 
high-frequency residue fluctuations (23). Servers that 
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implement structure-based methods for predicting DNA- 
binding residues include PreDs (24), DISPLAR (25), 
DBD-Hunter (26) and DNABINDPROT (23). 

In our previous work (27), we had developed a 
structure-based DNA-binding residue prediction method 
based on (i) electrostatics, (ii) conservation and (iii) 
geometry with the following rationale: (i) DNA-binding 
residues contain electropositive atoms, which would 
be in an unfavorable electrostatic environment in the 
absence of DNA or water; thus replacing one of these 
residues with a negatively charged Asp~/Glu~ would alle- 
viate the electrostatic repulsion among the electropositive 
atoms in the gas phase; (ii) DNA-binding residues and 
residues in the vicinity, which form a cluster of spatially 
interacting residues, are usually highly conserved within 
the same family due to their critical functional roles and 
(iii) DNA-binding residues have been observed to be 
located on surface patches, as opposed to clefts/cavities 
for RNA-binding residues and enzyme substrates. In this 
work, we have implemented our DNA-residue prediction 
method for public use in a web server, DR_bind (http:// 
dnasite.limlab.ibms.sinica.edu.tw). Whereas our published 
method for predicting DNA-binding sites had been tested 
on a non-redundant set of 56 DNA-bound and 23 DNA- 
free non-homologous protein structures (27), DR_bind 
was tested herein using an updated non-redundant set 
of 83 DNA-bound and 55 DNA-free structures (referred 
to as Data sets I and II, respectively). DR_bind was 
also tested using a protein-DNA docking benchmark con- 
taining 47 unbound-bound structures (28) and 15 
non-redundant DNA-bound protein structures with no 
or insufficient homologous seqeunces to compute conser- 
vation scores reliably. In contrast to current DNA-binding 
residue prediction servers, DR_bind is based on physical 
principles of binding thermodynamics (29) and does not 
require training on a set of protein-DNA complexes or 
any parameters. Hence, DR bind would be an opportune 
addition since structures of DNA-binding proteins have 
been rapidly rising. 

METHODS 

Data sets used 

DR_bind was tested using four data sets: I — 83 
non-redundant DNA-bound protein structures, II — 55 
non-redundant DNA-free protein structures, III — 47 
bound-unbound structures from the protein-DNA bench- 
mark version 1.2 (28) and IV — 15 non-redundant 
DNA-bound protein structures with no, or insufficient 
homologs to compute conservation profiles reliably. To 
create Data set I, all available X-ray structures of 
DNA-bound proteins solved to <3-A resolution were 
obtained from the current Protein Data Bank (PDB) 
(30). These protein chains were grouped according to 
their Class, Architecture, Topology and Homologous 
superfamily (CATH) codes (31). For each group of 
protein structures with the same CATH code, the struc- 
ture with the best resolution was selected as the represen- 
tative one. If any of these representative proteins share 
>30% sequence identity, the protein with the longer 



sequence was kept, while the others were discarded. This 
yielded 83 DNA-bound proteins that are sequentially and 
structurally non-homologous with conservation data 
(Supplementary Table SI), whereas the remaining 12 
proteins had no conservation profiles from the ConSurf- 
DB database (http://consurfdb.tau.ac.il/) (32). 

Data set II was derived from Data set I by searching 
each of the 83 DNA-bound proteins with conservation 
data for highly homologous proteins (sharing >90% 
sequence identity) with DNA-free structure(s) using the 
SAS tool (http://www.ebi.ac.uk/thornton-srv/databases/ 
sas/); if multiple DNA-free structures were found, the 
structure that showed the largest root-mean-square devi- 
ation (RMSD) from the DNA-bound structure using 
the SSAP program (33) was chosen as the representative 
one. This yielded 55 bound-unbound structures with a 
wide range of RMSDs (0.3-33 A). The PDB entries of 
the DNA-bound and free protein structures, the 
sequence identity between the DNA-bound and the re- 
spective free proteins computed using global alignment 
with ClustalW1.83 (34) and their RMSD values are 
given in Supplementary Table SI. 

Data set III is a protein-DNA docking benchmark con- 
taining 47 bound-unbound structures, of which 13 were 
classified as 'easy', 22 as 'intermediate' and 12 as 'difficult' 
cases for docking depending on the interface RMSD 
values between the DNA-bound and corresponding free 
structures. 'Easy', 'intermediate' and 'difficult' structures 
were defined by interface RMSD values ranging from 0 to 
2 A, 2 to 5 A, > 5 A, respectively. Data set III differs from 
Data set II in that it includes: (i) protein structures 
deposited in the September 2007 RCSB PDB; (ii) structur- 
ally homologous proteins with the same CATH code; (iii) 
free NMR structures; and (iv) 15 structures without con- 
servation data from ConSurf-DB. 

To create Data set IV, the 12 proteins excluded from 
Data set I and the 15 proteins from the benchmark set, 
which lack conservation profiles from ConSurf-DB, were 
grouped according to their CATH codes. For each group 
of protein structures with the same CATH code, the best 
resolution structure was selected as the representative one. 
This yielded 15 non-redundant proteins sharing <30% 
pairwise sequence identity (Supplementary Table S2). 

Definitions 

A residue was considered to bind DNA if it contains one 
or more non-hydrogen atom within van der Waals contact 
or hydrogen-bonding distance to the non-hydrogen atom 
of its binding partner directly or indirectly via a bridging 
water molecule. HBPLUS (35) was used to compute all 
possible hydrogen bonds and van der Waals contacts, 
which are defined by a donor atom to an acceptor atom 
distance <3.5 and <4.0A, respectively. An amino acid X 
is considered accessible for interacting with DNA if the 
percent ratio of its side chain solvent-accessible surface 
area in the protein to that in the tripeptide, -Gly-X- 
Gly-, is >5% (17,36). MOLMOL (37) was used to 
compute the relative solvent-accessible surface area of 
each amino acid from the protein structure using a 
solvent probe radius of 1.4 A. 
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Geometry 

Since DNA-binding sites are found on a protein surface, 
surface patches were generated by denning the C a atom of 
each residue as an origin of a patch and including all 
residues whose C a atoms were within 10 A of the origin 
in the patch. Non-identical patches with more than five 
solvent-accessible residues were used in computing the 
average electrostatic energy change and conservation 
(see below). 

Electrostatics 

Given a /-residue DNA-binding protein structure, all Asp/ 
Glu residues were deprotonated, while Arg/Lys residues 
were protonated; His residues were protonated or 
deprotonated depending on the availability of hydrogen 
bond acceptors in the structure. Next, / mutant structures 
were generated by replacing Ala, Asn, Asp, Cys, Gly, Ser, 
Thr or Val in the wild-type structure to Asp - and the 
other residues to Glu~. The side chain replacements 
were carried out using SCWRL (38), followed by energy 
minimization with heavy constraints on all heavy atoms 
using AMBER (39) to relieve any bad contacts. Based on 
the wild-type/mutant structures, the gas-phase (s = 1) 
electrostatic energy of the wild-type (.E 6 '*^) or mutant 
(^mut) protein in the 'folded' state relative to that in 
an 'extended reference' state (is' elec wt or £' elec mut ) was 
computed using AMBER (39) with the all-hydrogen-atom 
AMBER force field (40). In this extended reference state, 
the residues do not interact with one another; hence, the 
electrostatic energy difference between the wild-type 
(l?' elec wt ) or mutant (ii' elec mut ) 'unfolded' protein is equal 
to the difference between the electrostatic energies of the 
native residue at position i (£" elec ; ) and the corresponding 
mutant Asp~/Glu~ (Zs' elec D/E ). The change in the 
gas-phase electrostatic energy AAZs* 160 ,-, upon mutation 
of residue i to Asp~/Glu~ is given by: 

AA£f c r = (ij^ - O - CE£/e " £f eC ) (1) 

The average electrostatic energy change <AAF lec >, of 
the N aa i residues comprising surface patch i was computed 
from: 

< AA^ >,= J2 AA£f7iVf (2) 

where the summation in Equation (2) is over all residues in 
patch i. 

Conservation 

For a given DNA-binding protein, the conservation score 
C, of residue i was obtained from the ConSurf-DB 
database (32) or ConSurf server (41^13). The C, score is 
an integer number, ranging from 1 (for a rapidly evolving, 
highly variable residue) to 9 (for a slowly evolving, 
conserved residue). The average conservation <C> ; - of 
the N da j residues comprising surface patch i was 
computed from: 

<C> i =Y,C j /N** (3) 



DNA-binding residue prediction 

To determine the DNA-binding residues in a given 
protein, the distinct patches were ranked according to 
the <AAF lec >, values so that the top-ranked cluster 
had the most favorable (most negative) <AA£ elec >„ 
whereas the bottom-ranked cluster had the least favorable 
<AA£ elec >, . Among the top 10% <AA£ elec > r ranked 
surface patches, the three patches with the largest <C>, 
values were selected and the constituent solvent-accessible 
residues were predicted to bind DNA. 

Performance measures 

To evaluate the performance of DR_bind, the numbers of 
correctly predicted binding residues (TP) and non-binding 
residues (TN), as well as the numbers of incorrectly pre- 
dicted binding residues (FP) and non-binding residues 
(FN) were computed and used to determine: 



sensitivity = TP/(TP+FN) (4) 

specificity = TN/(FP+TN), (5) 

precision = TP/(TP+FP) (6) 

accuracy = (TP+TN)/(TP+FP+TN+FN) (7) 

Matthew's correlation coefficient or MCC 

= (TP x TN) - (FP x FN)/ (8) 
[(TP+FP) (TP+FN) (TN+FP) (TN+FN)] 1/2 



DRJbind web server 
Input 

On the DR_bind web page http://dnasite.limlab.ibms 
.sinica.edu.tw/, users are given two options: For option 
A, users upload their own file in PDB format and the 
evolutionary data for their protein in ConSurf format or 
ask DR_bind to retrieve the evolutionary data from 
ConSurf. For option B, users enter the PDB code and 
chain identifier; if the conservation profile for the sub- 
mitted protein structure has not been pre-calculated in 
the ConSurf-DB database (32), DRbind will attempt to 
generate the ConSurf data automatically from the 
ConSurf server (41-43). If no ConSurf data can be 
generated, DR_bind will continue to predict DNA- 
binding residues based only on the protein 3D structure 
and inform the user of the missing ConSurf data on the 
Results page. For multiple submissions, we have provided 
a simple form that allows for nine PDB codes with chain 
identifiers to be defined. After users click on the 'submit' 
button, the input data is checked for consistency: Residues 
in the PDB file that do not correspond to the standard 20 
amino acid are removed, as well as multiple alternative 
residue positions. If the input data pass these tests, then 
the prediction process is started and the user is taken to a 
web page where the results for the job(s) and their status 
on the DR-bind server can be monitored. 

Output 

When the DR bind server has finished the prediction, the 
results page is updated with the predicted binding 
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Welcome to DR_bind, the DNA-binding residue prediction server and the results for 1TSR chain B 

Status: Job FINISHED 

You can find a PvMol .pml file here for download, the original and the cleaned up PDB files and also the raw output from the prediction process. To download the 
files please right-click on the following links. 

Original PDB file: ITSR.orip , Original ConSurf file: ITSR.consurf . 

Cleaned PDB file: ITSR.pdb . Pymol pml files: ITSR.pml . Text output files: ITSR.m . 

Predicted DR_bind DNA-binding residues in your structure 

Prol77, HisHS, Asn239, Ser241, Cys242. Met243. Gly244, Asn247, Arg248, Arg249, Pro250, Arg273, Ala276. 



The image shows the backbone of your protein structure, the predicted DR_bind DNA-binding results are depicted in red. If the 
image is missing, then you probably need to install a Java virtual machine from the Java website and restart your browser. Full 
JMol applet instructions are available. 




To get information on a residue 
hover the mouse over that residue 
or -1 second. 
To rotate use left-click. 
To translate use Ctrl & right-click 
and 

To zoom use the mouse wheel. 



PR bind is hosted at The Institute of Biomedical Sciences . Academia Sinica . Taipei 1 1529. Taiwan. 
Figure 1. An example of the Results page from DR_bind. 



residues. If the user had provided an e-mail address, the 
web server will send an e-mail to let the user know that the 
prediction has been completed with a link to the results 
web page. Users can then access the l-esults page to see the 
generated prediction. As shown in Figure 1, the results 
page is split into three sections: the first section has links 
to downloadable files of (i) the original PDB and ConSurf 
files, (ii) the 'cleaned' PDB file used by DR bind, (hi) a 
PyMOL script for highlighting the predicted DNA- 
binding residues and (iv) a text file of these residues. The 
second section lists the predicted DNA-binding residues. 
The third section is an interactive embedded 3D represen- 
tation of the protein with the entire backbone in ribbon 
format with the predicted interaction i-esidues depicted in 
stick format in red. This 3D representation is created using 
Jmol (http://www.jmol.org/) and can be rotated and 
zoomed in/out on the results page itself. 



DR_bind currently runs on an Apple Mac Mini 
quad-core i7 server and the time taken to yield a predic- 
tion depends on the number of residues in the PDB chain. 
A prediction takes ~5min for 50 residues, ~1.5h for 200 
residues, ~4.5h for 350 residues and ~10h for 450 
residues. To handle simultaneous requests, the Torque 
batch processing software is used to queue jobs. Help 
pages with instructions on how to use the server are avail- 
able at http://dnasite.limlab.ibms.sinica.edu.tw/examples/ 
help.html. 



RESULTS AND DISCUSSION 

Performance and limitations of DRJbind 

In our previous works (27), we pi-esented a method for 
predicting DNA-binding sites based on electrostatics, 
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conservation and geometry given the respective protein 
structure and tested it on a set of 56 structurally non- 
homologous proteins with DNA-bound structures, as 
well as a smaller subset of 23 proteins with both 
DNA-bound and free structures. Based on the 
DNA-free and DNA-bound protein structures, 83 and 
86% of the DNA-binding proteins have statistically sig- 
nificant DNA-binding sites, respectively. Thus, the 
method was found not to be very sensitive to protein con- 
formational changes upon DNA binding (27,44). 
However, like all structure-based prediction methods, it 
cannot predict binding residues in regions that are dis- 
ordered in the free protein structure. Another limitation 
of the method is that the predicted residues may be 
involved in binding non-DNA ligands such as RNA, 
protein, small molecules or metal ions rather than DNA 
(27,44). 

In this work, we have implemented our DNA-binding 
residue prediction method as a free web server called 
DRbind, which requires as input, the protein 3D 



Table 1. Comparison of the performance measures of DR_Bind using 
our nonredundant data set of 83 DNA-bound and 55 DNA-free 
protein structures and the protein-DNA benchmark version 1.2 
containing 47 DNA-bound and free protein structures 



Data set 


I (bound) 


II (free) 


III (bound) 


III (free) 


No. of structures 


83 


55 


47 


47 


TP 


728 


419 


468 


417 


FP 


831 


566 


371 


429 


TN 


18 128 


11596 


6486 


6435 


FN 


1,362 


792 


702 


693 


Precision 


0.47 


0.43 


0.56 


0.49 


Sensitivity 


0.35 


0.35 


0.40 


0.38 


Specificity 


0.96 


0.95 


0.95 


0.94 


Accuracy 


0.90 


0.90 


0.87 


0.86 


mcc 


0.35 


0.33 


0.40 


0.35 



structure and yields as output, experimentally testable 
residues that are predicted to bind DNA. As more 
DNA-binding protein structures have been solved since 
validation of our method (27), and some of these may 
correspond to novel folds, DR_bind was further tested 
using our updated set of 83 DNA-bound and 55 bound- 
unbound non-homologous protein structures, as well as 
the protein-DNA benchmark version 1.2 containing 47 
bound-unbound structures (28). DR_bind yielded 47% 
precision, 35% sensitivity, 96% specificity, 90% accuracy 
and 35% mcc in predicting DNA-binding residues using 
our bound data set, and slightly lower precision (43%) 
and mcc (33%) values using our free data set (Table 1), 
even though the RMSD of the DNA-free structure from 
the Respective DNA-bound structure may be as large as 
33 A (Supplementary Table SI). Similar trends were found 
for the benchmark data set: DR_bind yielded 56% preci- 
sion, 40% sensitivity, 95% specificity, 87% accuracy and 
40% mcc using the DNA-bound structures and lower pre- 
cision (49%) and mcc (35%) values using the correspond- 
ing free structures (Table 1). The sensitivity values are low, 
as DR_bind predicts the most likely DNA-binding 
residues, rather than all DNA-binding residues at the 
protein-DNA interface. 

To assess the reliability of the performance values in 
Table 1, we randomly chose 40 of the 83 DNA-bound 
structures and 25 of the 55 DNA-free protein structures 
and computed the various performance measures; this 
procedure was repeated 1000 times in order to obtain 
the distribution of each performance measure. Figure 2a 
and b illustrates the percent frequency of the DRbind's 
precision values (solid lines) for the bound and free data 
sets, respectively. The lower limits of precision, sensitivity, 
specificity, accuracy and mcc in predicting DNA-binding 
residues using DR_bind for the bound/free data sets are 
0.38/0.31, 0.29/0.26, 0.94/0.93, 0.87/0.86 and 0.29/0.24, 
whereas the corresponding upper limits are 0.56/0.55, 
0.44/0.49, 0.97/0.97, 0.91/0.92 and 0.43/0.44. Notably, 



DNA-bound structures 



DNA-free structures 



DFLbind 
BindN+ 
NAPS 

DNABINDPROT 




0.4 0.5 0.6 

precision 



DFLBind 

BindN+ 

NAPS 

DNABINDPROT 




0.4 0.5 0.6 

precision 



Figure 2. The percent frequency of a precision value derived from 1000 random choices of (a) 40 DNA-bound structures from Data set I and (b) 25 
DNA-free protein structures from Dataset II. The solid, dashed, dotted and dashed-dotted curves correspond to precision values obtained using 
DR_bind, BindN+, NAPS and DNABINDPROT, respectively. 
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these limits encompass the precision, sensitivity, 
specificity, accuracy and nice values obtained using the 
47 bound-unbound structures from the benchmark 
data set. 

Comparisons with other servers that predict 
DNA-binding residues 

Using our bound and free data sets, the performance 
of DRbind was compared with that of three recent 
web servers, BINDN+ (http://bioinfo.ggc. org/bindn+/), 
NAPS (http://proteomics.bioengr.uic.edu/NAPS) and 
DNABINDPROT (http://www.prc.boun.edu.tr/appserv/ 
prc/dnabindprot/). BINDN+ (12) uses support vector 
machines with three biochemical features (hydrophobicity, 
side chain pKa and mass of an amino acid residue) 
incorporating evolutionary information and position- 
specific scoring matrix (PSSM). Instead of support 
vector machines, NAPS (15) employs ensemble classifiers 
based on C4.5, bootstrap aggregation and a cost-sensitive 
learning algorithm with residue charge and PSSM. 
Whereas BINDN+ and NAPS are sequence-based 
methods, DNABINDPROT (23) is a structure-based 
method that identifies high-frequency fluctuating con- 
served residues and ranks them according to their 
DNA-binding propensity. These web servers were chosen 
for comparison with DR_bind because they had been 
tested using published data sets and had been shown to 
outperform previous methods/web servers: Using the 
PDNA-62 data set, the average of sensitivity and specifi- 
city obtained by BINDN+ (78.3%) and NAPS (78.5%) 
were similar (12,15) and higher than that obtained by 
DP-Bind (76.5%) or DBS-PSSM (67.1%). Using a set of 
36 DNA-binding proteins with both free and DNA-bound 
structures and conservation scores, the precision obtained 
by DNABINDPROT using a fast threshold of 0.1, 
conservation threshold of 5, and neighboring two 



Table 2. Comparison of the performance measures of DR_Bind, 
BindN+, NAPS and DNABINDPROT using the same data set of 83 
DNA bound" or 55 DNA-free protein structures b '° 



Server 


DRBind 


BindN+ 


NAPS 


DNABINDPROT 


TP 


728 (419) 


1013 (542) 


328 (180) 


244 (169) 


FP 


831 (566) 


1798 (1129) 


733 (459) 


1040 (772) 


TN 


18 128 (11596) 


17 161 (11033) 


18 226 (11703) 


17919 (11 390) 


FN 


1362 (792) 


1077 (669) 


1762 (1031) 


1846 (1042) 


Precision 


0.47 (0.43) 


0.36 (0.32) 


0.31 (0.28) 


0.19 (0.18) 


Sensitivity 


0.35 (0.35) 


0.48 (0.45) 


0.16 (0.15) 


0.12 (0.14) 


Specificity 


0.96 (0.95) 


0.91 (0.91) 


0.96 (0.96) 


0.95 (0.94) 


Accuracy 


0.90 (0.90) 


0.86 (0.87) 


0.88 (0.89) 


0.86 (0.86) 


nice 


0.35 (0.33) 


0.34 (0.31) 


0.16 (0.15) 


0.08 (0.09) 



"The PDB entries are listed in Supplementary Table SI; the total 
number of residues in the data set is 21 049, out of which 2090 
residues are DNA-binding ( = TP+FN) and 18 959 residues are non- 
DNA-binding ( = FP+TN). 

b Performance measures based on the DNA-free protein structures are 
in the parentheses. 

The PDB entries are listed in Supplementary Table SI; the total 
number of residues in the dataset is 13 373, out of which 1211 
residues are DNA-binding ( = TP+FN) and 12162 residues are non- 
DNA-binding ( = FP+TN). 



residues (45.3%) was higher than that obtained by 
DBD-HUNTER (44.5%), DISPLAR (40%) and 
DP-Bind (33.0%) (23). 

Using our bound and free data sets, the performance 
results of all four servers are summarized in Table 2. Since 
DR Bind does not aim to predict all residues at the 
protein-DNA interface, its sensitivity (35%) is lower 
than that of BINDN+ (45^18%), which has almost 
twice the number of predictions (i.e. TP + FP). Rather 
than knowing all residues that comprise the protein- 
DNA interface, most biologists would be interested in 
testing if the predicted residues do indeed bind DNA 
and therefore, a method's precision, which reflects the 
fraction of predicted residues that are correct. Compared 
with the other methods, DR Bind yields a >10% higher 
precision for both data sets. To assess if the difference in 
precision using DR_Bind and the other three methods is 
statistically significant, we randomly chose 40 and 25 
protein structures from the bound and free data sets, re- 
spectively, and computed the precision obtained by each 
of the four servers; this was repeated 1000 times. The pre- 
cision values obtained by DR_bind using the DNA-bound 
(0.38-0.56) and DNA-free structures (0.31-0.55) are gen- 
erally higher than those obtained by the other three 
methods, as shown in Figure 2. This is also shown by 
the paired f-test, which was used to test the null hypothesis 
that DR_Bind does 'not' yield higher precision than the 
other three methods. The resulting P< 0.00001 for 
both bound and free data sets rejected the null 
hypothesis (Supplementary Table S3). Hence, an experi- 
mentalist would likely find more residues predicted 
by DR bind to bind DNA compared with those 
predicted by sequence-based methods, thus saving time 
and costs. 

Compared with sequence-based methods to predict 
DNA-binding residues, the structure-based DR_bind 
approach incorporates structural information (that is, 
electrostatics and geometry) of the query protein. 
Therefore, it would be expected to perform much better 
than sequence-based methods when evolutionary informa- 
tion for a query protein is not available. To show the im- 
portance of additional structural information, we tested 
the structure- and sequence-based methods on a set of 
15 non-redundant DNA-bound protein structures 
with no or unreliable ConSurf conservation profiles. 
Note that dnabindprot could not be applied to this set 
of 'unique' DNA-binding proteins because it does 
not yield predictions for proteins without ConSurf-DB 
conservation data. The performance results of DR_bind, 
BINDN+ and NAPS in Table 3 show that the difference 
in performance between DRbind and the 
two sequence-based methods become more apparent for 
proteins without conservation data: the precision 
of DR_bind (47%) is nearly twice that of BINDN+ 
(27%) and NAPS (23%). Thus, for DNA-binding 
proteins with no or insufficient homologs, DR_bind 
could provide a significantly higher fraction of correctly 
predicted DNA-binding residues than sequence-based 
methods. 
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Table 3. Comparison of the performance measures of DRBind, 
BindN+ and NAPS using the same data set of 15 DNA-bound 
protein structures with no or insufficient close homologs' 1 



Server 


DRBind 


BindN+ 


NAPS 


TP 


110 


230 


34 


FP 


122 


618 


115 


TN 


2585 


2089 


2592 


FN 


292 


172 


368 


Precision 


0.47 


0.27 


0.23 


Sensitivity 


0.27 


0.57 


0.08 


Specificity 


0.95 


0.77 


0.96 


Accuracy 


0.87 


0.75 


0.84 


mcc 


0.29 


0.26 


0.07 



"The PDB entries are listed in Supplementary Table S2; the total 
number of residues in the data set is 3109, out of which 402 residues 
are DNA-binding ( = TP+FN) and 2707 residues are non DNA-binding 
( = FP+TN). 
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